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Abstract 

We present a novel attribute learning framework named 
Hypergraph-based Attribute Predictor (HAP). In HAP, a hy¬ 
pergraph is leveraged to depict the attribute relations in the 
data. Then the attribute prediction problem is casted as a 
regularized hypergraph cut problem in which HAP jointly 
learns a collection of attribute projections from the feature 
space to a hypergraph embedding space aligned with the 
attribute space. The learned projections directly act as at¬ 
tribute classifiers (linear and kernelized). This formulation 
leads to a very efficient approach. By considering our model 
as a multi-graph cut task, our framework can flexibly in¬ 
corporate other available information, in particular class 
label. We apply our approach to attribute prediction. Zero- 
shot and N-shot learning tasks. The results on AWA, USAA 
and CUB databases demonstrate the value of our methods 
in comparison with the state-of-the-art approaches. 

1. Introduction 

Attribute learning aims at achieving an intermediate rep¬ 
resentation on top of the low-level visual feature space, 
which encodes semantic properties shared across different 
categories of objects or scenes. Such an intermediate repre¬ 
sentation plays a role as the vehicles of semantics in human- 
machine communication. Farhadi et al. [8] and Lampert et 
al. [ 1 8] showed that supervised attributes can be transferred 
across object categories, allowing description and naming 
of objects from categories not seen during training. There¬ 
fore, attributes provide a way to encode and share knowl¬ 
edge to achieve challenging tasks such as the Zero-Shot 
Learning (ZST) problem, where the goal is to categorize 
classes that are unseen during training [18, 17]. 

A lot of approaches have been proposed for attribute 
learning, e.g. [18, 8, 27, 21, 1, 14, 10, 20]. Two funda¬ 
mental issues remain unsolved, although recent researches 
have started to pay attention to them e.g. [1, 14, 10]. First, 
traditional approaches learn attributes independently from 
each other (one-vs-all classifiers) [18, 8], without explic¬ 
itly exploiting the correlation between attributes. Second, 
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Figure 1. The visualization of the hypergraph cut for predicting the 
attribute ’’water”. Each ellipse with solid lines denotes a hyper¬ 
edge that encodes an attribute relation and each circle represents a 
sample. 

learning attribute classifiers are typically done independent 
of the subsequent tasks, such as categorization or zero shot 
learning, and typically category labels are ignored in the 
learning process. Optimizing the attribute prediction inde¬ 
pendent of the succeeding task does not guarantee to yield 
the best attribute predictor for that task. 

Correlations naturally exist among attributes. Is it bet¬ 
ter to exploit the correlation or to discourage the correlation 
during attribute learning, i.e. decorrelation? Several papers 
have argued that exploiting correlation between attributes 
improves their discriminative powers, e.g. [31, 28]. Recent 
works attempted to address the issue of joint attribute learn¬ 
ing with a focus on decorrelating attributes [1, 14]. Jayara- 
man et al. [14] argued that attribute learning approaches 
are prone to learn visual features that correlates with at¬ 
tributes, not attributes themselves. Therefore they argued 
for decorrelation of attributes, by exploiting feature compe¬ 
tition during learning through a multitask learning frame¬ 
work. Decorrelation of attributes might be suitable in tasks 
such as describing images with text, key-word based re¬ 
trieval, or generating image annotations. However, pre¬ 
serving and exploiting correlation between attributes should 
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preserve the natural clustering in the data and should be bet¬ 
ter for classihcation and zero-shot learning tasks. Correla¬ 
tion is a nature of attributes and compulsively decorrelating 
the attributes may break the original relations of attributes 
in the visual space as well as in the semantic space. 

To illustrate our point we use the example in Figure 1 
where we used attributes from the AWA dataset [18]. Con¬ 
sider learning the attribute “water”. In the hgure, there 
are two clusters denoted as two superclasses (terrestrial and 
aquatic animals). It is expected that the attributes in each 
cluster will be highly correlated and coexist in images. The 
conventional classihers directly learn an optimal separation 
for the attribute “water” which clearly ignores the correla¬ 
tions as well as the natural clusters in the data. Although it 
achieves the optimal attribute prediction, it clearly reduces 
the utility of attribute predictors in subsequent tasks such 
as categorization, and is much easier to get mired in overht- 
ting. Instead we aim at a cut that, besides minimizing the at¬ 
tribute prediction loss, tries to preserve the clustering in the 
data. Preserving the correlation can be even more benehcial 
for attributes that are not visual. For example the attribute 
“swim” describes an action and it is very hard to predict it 
from visual features if it is forced to be decorrelated from 
other attributes such as “water”. 

Our goal is to design a new attribute learning frame¬ 
work that addresses the two aforementioned issues, i.e. 
jointly learning attributes while exploiting the correlation 
between attributes, and exploiting class information as well 
as any available side information. We propose to model 
the attribute learning as a supervised hypergraph cut prob¬ 
lem. As a generalization of graphs, hypergraphs are typ¬ 
ically used to depict the high-order and multiple relations 
of data [35, 5, 15, 29]. One merit of hypergraph is that 
it can capture the correlations of multiple relations, since 
the partition of a vertex set who has many common hyper¬ 
edges will lead to a heavy penalty [29]. In our formulation, 
we dehne a hypergraph where each vertex corresponds to a 
sample and a hyperedge is a vertex set sharing the same at¬ 
tribute label. Then, we can consider the attribute prediction 
problem as a hypergraph cut problem. More specihcally, 
a collection of hypergraph cuts (one cut per attribute) that 
minimizes the loss of attribute relations (dehned by the hy¬ 
peredges) is jointly learned. Moreover, such cuts also min¬ 
imize the attribute prediction errors of training data. Since 
hypergraph cuts can be deemed as the hypergraph embed¬ 
ding from the perspective of graph embedding [26, 3, 32], 
this step actually tries to align the embedding space, which 
encodes the attribute relations, with the semantic attribute 
space. We also propose attribute predictors (or classihers) 
that can be obtained by introducing a mapping from the fea¬ 
ture space to this aligned hypergraph embedding space. We 
name this approach Hypergraph-based Attribute Predictor 
(HAP), which can be combined with different hypergraph 
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Figure 2. The overview of our approach, we learn a collection of 
the mappings (projections) from the feature space to the hyper¬ 
graph embedding space which is aligned by the attribute space and 
encode the attribute relations. The learned projections span the At¬ 
tribute Prediction Space (APS) in which each basis is an attribute 
predictor. The attribute prediction of a sample can be achieved by 
projection it to the APS and then the attribute-based categorization 
can be performed. 


models to obtain classihers from the cuts. We illustrate our 
model in Figure 2. 

In order to incorporate class information and any ad¬ 
ditional side information within the HAP formulation, we 
consider it as a multi-graph cut problem. One or several 
additional graphs (or hypergraphs), which encode side in¬ 
formation, will be introduced as the penalties to the HAP. 
In that case, the new cuts should not only minimize the 
loss of attribute relations, but also the losses of the side 
information. In case of class labels, we formulate a new 
HAP framework, denoted as Class-Specihc HAP (CSHAP), 
which enhances the discriminating ability of the attribute 
predictors. A hypergraph and a graph, which encode the 
class information in two different ways, are respectively 
leveraged to produce two different versions of the CSHAP 
approaches. We denote them Hypergraph-based CSHAP 
(CSHAP//) and Graph-based CSHAP (CSHAPg) respec¬ 
tively. Finally, all the proposed approaches are kernelized 
to incorporate the nonlinearity. 

We summarize the contributions of this paper as follows: 

1. As far as we know, our approach is the hrst to formu¬ 
late attribute learning as a supervised hypergraph cut 
problem. 

2. We propose an approach to construct predictors (linear 
and nonlinear classiher) jointly while solving for the 
cuts. This idea is applicable in general (not limited 
to the context of attribute learning) to any hypergraph 
cut algorithm, as a general path to derive the classihers 
from graph model. 

3. We provide a hexible solution to incorporate the side 






































information in attribute learning. 

4. The proposed approach provides efficient attribute pre¬ 
dictions, since the computational complexity of the at¬ 
tribute prediction is linear with respect to the dimen¬ 
sion of feature. 

We experimented with three datasets: Animal With At¬ 
tributes (AWA) [18], Caltech-UCSD Birds (CUB) [30] and 
Unstructured Social Activity Attribute (USAA) [10]. The 
results on attribute prediction. Zero-shot, N-shot Learning, 
and categorization consistently validate the effectiveness of 
the proposed framework. The rest of paper is organized as 
follows: Section 2 presents the related works; Section 3 de¬ 
scribes the proposed approach. Section 4 shows the exper¬ 
imental evaluation of our works; the conclusion is summa¬ 
rized in Section 5. 

2. Related Works 
2.1. Attributes 

Traditional attribute learning approaches follow a su¬ 
pervised discriminative learning pipeline (one-vs-all clas¬ 
sifiers) where attribute classihers are learned independently 
given attribute labels for each image or each class [8, 18]. 
Recently several papers suggested approaches for joint 
learning of attributes [19, 31, 1, 14]. Wang et al. [31] 
and Song et al. [28] construct a graph of attributes in at¬ 
tribute domain from the training data and consider it as the 
latent variables in the latent SVM for categorization, and 
they showed that exploiting attribute relations helps im¬ 
prove class prediction. In contrast our hypergraph is con¬ 
structed on the samples, which facilitates aligning the fea¬ 
ture space with the attribute semantic space. 

Mahajan et al. [19] proposed a joint learning framework 
that removes the correlations as the redundancies during 
learning the mapping between attributes and classes which 
actually ignores the contributions of the co-occurrence at¬ 
tributes. Akata et al. [1] proposed an approach that simul¬ 
taneously target three problems: optimizing attribute pre¬ 
diction using class labels, using side information, and in¬ 
cremental learning. They achieved decorrelation by using 
dimensionality reduction on the class-attribute matrix, i.e. 
in the label space, and showed that the attribute dimen¬ 
sions can be reduced significantly without affecting the ac¬ 
curacy. In contrast our hypergraph construction achieve cor- 
relation/decorrelation by employing the sample-attribute re¬ 
lations and embeding the data samples in a space that is 
aligned with the attribute labels. Recently, Jayaraman et 
al. [14] attempted to decorrelate the attributes via solving a 
structure sparsity model in which semantic attribute groups 
are manually provided as auxiliary data. 

Attribute learning is just the preliminary step for some 
other visual tasks. Few approaches have been proposed to 
incorporate the additional information, in particular class 
labels into attribute learning for benehting the subsequent 


tasks [8, 1, 31, 4, 33]. However, the attribute learning and 
the exploitation of the additional information are highly 
coupled in these approaches. It is very hard to add novel ad¬ 
ditional information into the model, or unplug the additional 
information exploitation part from the original models when 
the additional information of the given data is not available. 
In contrast, our proposed method can flexibly to address this 
issue by considering the attribute prediction task as multi¬ 
graph (hypergraph) cut problem, enabling adding any side 
information as an extra graph or hypergraph. 

Zero-Shot Learning (ZSL) is the task of object recogni¬ 
tion for categories with no training examples [18]. Several 
intermediate representations has been used for ZSL, includ¬ 
ing attributes [18, 8, 17, 1], linguistic knowledge [9, 23], 
textual description [7] and visual abstraction [2]. The core 
attribute-based ZSL approaches are the attribute learning. 
Therefore our attribute prediction can be integrated to sev¬ 
eral ZSL frameworks. Although our work is an attribute- 
based ZSL, other intermediate representation can also be 
readily plugged into the HAP framework to replace the at¬ 
tributes, since it generally provides a mapping from the low- 
level representation to intermediate representation. Sev¬ 
eral ZSL approaches can be extended to N-Shot Learn¬ 
ing [34, 1,9, 25]. 

2.2. Hypergraph Learning 

Hypergraph is a generalization of the regular graph 
which has been widely applied to depict the high-order rela¬ 
tions between data points [35, 5, 15, 29]. Since the attribute 
prediction task can be regarded as a multi-label classihca- 
tion problem, we will introduce not only the relevant hyper¬ 
graph model, but also some hypergraph-based multi-label 
classihcation algorithms, which motivated our work. More 
specifically, Zhou et al. [35] proposed a normalized hyper¬ 
graph model for embedding and transduction. Our formu¬ 
lation is based on Zhou’s model, however it is inductive. 
Moreover, we show how the hypergraph cut can be reformu¬ 
lated to provide direct linear or nonlinear class predictors. 
Chen et al. [5] leveraged the hypergraph to capture the cor¬ 
relation of categories and introduced it as a regularization 
to SVM model for multi-label classification. Similar to [5], 
Sun et al. [29] used the hypergraph to capture the correla¬ 
tion of classes and performed a hypergraph embedding as 
the new representation for multi-class classification. In [15] 
the hypergraph is utilized to measure the loss of multi-labels 
during the multi-kernel learning. In contrast to all the exist¬ 
ing hypergraph learning algorithms, as far as we know, our 
model is the hrst approach that directly derives the multi¬ 
label classifier from the hypergraph embedding. 

3. Approach 
3.1. Preliminaries 

We start by reviewing some basic dehnitions of hyper¬ 
graphs and introducing the notations. Hypergraphs are a 


generalization of graphs in which a hyperedge (the analogy 
of an edge) is an arbitrary non-empty subsets of the vertex 
set (Fig 1 shows a hypergraph with five hyperedges). Given 
a hypergraph G = (y,i?) in an arbitrary feature space, V 
and E are the vertex set and hyperedge set, where each ver¬ 
tex and hyperedge are respectively defined as v € V and 
e G E. Moreover, a hyperedge e is a subset of V. The 
vertex-edge incidence matrix El G is defined as 

follows 


h(v, e) 


1 , if V G e 
0, otherwise. 


( 1 ) 


The degree of a hyperedge e, which is denoted as S(e), is 
the number of vertices in e 


^(e) = '^h(v,e), (2) 

and the degree of a vertex u € F is defined as follows 

d{v) = w(e) = ''^^w(e)h{v,e), (3) 

v^e,e^E e^E 


where w{e) is the weight of the hyperedge e. We denote the 
diagonal matrix forms of <5(e), d{v) and w{e) as D^, Dy and 
W respectively. Note, here we defined H as a binary matrix 
for simplicity. For the continuous value case, a probabilis¬ 
tic hypergraph model [12] can be adopted in which each 
element of El denotes the probability of a vertex in a hyper¬ 
edge. 

3.2. Attribute Hypergraph 

In our model, we define a hypergraph to depict the at¬ 
tribute relations of samples (corresponding to images in the 
training set). In this hypergraph, the vertex Vi G V is cor¬ 
responding to the sample Xi G X, which is the i-th col¬ 
umn of the d X n-dimensional sample matrix X. Here, n 
is the number of samples and d is the dimension of the fea¬ 
ture space. Each hyperedge is defined as a vertex set that 
shares the same attribute label. In such case, the number of 
hyperedges is equal to the number of attributes m and the 
n X m-dimensional matrix incident matrix H is exactly the 
attribute label matrix. The more common attributes among 
a set of images, the more hyperedges will exist between 
their corresponding vertices, and the stronger the link will 
be between these vertices. Therefore, break such a link will 
lead to a heavy penalty during the learning process. In this 
way, the hypergraph actually provides a natural way to cap¬ 
ture the correlation/decorrelation of attributes. We regard 
the hyperedge e as a clique and consider the mean of the 
heat kernel weights of the pairwise edges in this clique as 
the hyperedge weight 


w(e) 





Certainly, some other hyperedge weighing schemes can be 
also applied. 


3.3. Hypergraph-based Attribute Predictor 

Normalized hypergraph cut is often utilized to learn the 
high-order relation and correlation information. The main 
idea of our model stems from the hypergraph-based trans¬ 
duction which is regarded as a regularized normalized hy¬ 
pergraph cut model [35]. In contrast, our method is a super¬ 
vised inductive model. Since our proposed attribute predic¬ 
tor (or classifier) is based on the hypergraph model, we call 
it Hypergraph-based Attribute Predictor (HAP). 

In HAP, a collection of hypergraph cuts E = 
[/i I ■ ■ ■ j fm] is defined as the predictors of attributes in the 
feature space where m is the number of attributes and the 
cut fi is a column vector whose elements are the predictions 
of Ath attribute for each sample. An optimal cut should not 
disrupt the hyperedges during hypergraph partition as much 
as possible. In other words, the optimal cut should keep the 
attribute relations of samples as much as possible since each 
hyperedge is given by an attribute label. Similar to Zhou’s 
normalized hypergraph [35], we can define an attribute re¬ 
lation loss function with respect to the given hypergraph G 
and a collection of hyperedge cuts E can be denoted as fol¬ 
lows 


n{F,G) 


e^E (u,v)Ge 


w{e) 

W) 




(5) 


where Fy returns a row vector corresponding to the predic¬ 
tions of attributes for the vertex u. Clearly, the loss will be 
reduced when the signs of and Fy are identical. Follow¬ 
ing some deductions. Equation 21 can be reformulated as 
follows 


n{F,G) 


2 


2 E E 

e^E (u,v)Ge 

Tr(F^(7- D 
Ti{F'^LhF), 


w{e) 

W) 

- 1/2 


Fu Fy 

y/d{u) ^/d{v) 

HWD~^H'^D~^^'^)F) 


(6) 


where Lh is the normalized hypergraph Laplacian matrix 
which is derived from the hypergraph of attributes, and I 
is an identity matrix. Tr(-) is the trace of the matrix. The 
detail deductions of Equation 21 can be found in the sup¬ 
plementary material. 

Besides measuring the loss of attribute relation informa¬ 
tion, we also need to consider the attribute prediction errors 
of the train data, which can be obtained via calculating the 
Euclidean distance between attribute predictions F and the 
attribute label matrix. In order to make zero as the classi¬ 
fication boundary, we define a shifted attribute label matrix 
Y via shifting the attribute labels as F = 2H — 1 where 
1 is a matrix of the same size as H whose elements are all 
equal to 1. In Y, if an attribute exists in a sample, its cor¬ 
responding attribute label is 1, otherwise it is -1. Given this 
definition, the attribute prediction loss is defined as 

A(E,y) = ||E-y||" 


(7) 



















Simultaneously minimizing the the previous two losses 
leads to our model 

F = argimn{fl(F, G) + XA{F, F)} 

= Ti{F'^L„F) + X\\F-Y\\\ (8) 

where A is a positive parameter to reconcile these two 
losses. Now, the optimal hypergraph cut fi introduces a 
binary partition to hypergraph that can preserve the infor¬ 
mation of i-th attribute relation and reduce the prediction 
error of i-th attribute as much as possible. 

From the perspective of graph embedding, the hyper¬ 
graph cuts are the embedding of the given hypergraph, 
where the embedding coordinate of sample u is the u-th 
row of F. Equation 8 actually aligns the hypergraph embed¬ 
ding space (dehned by F) with the shifted attribute space. 
Consequently, we now transform the problem of seeking at¬ 
tribute predictors/cuts to the problem of hnding a mapping 
from the feature space to this aligned embedding space, i.e. 

F = (9) 

where the projection matrix B = [/3i, • • • , /3m] is 

such collection of mappings whose z-th column is a predic¬ 
tor of the z-th attribute, corresponding to the z-th hypergraph 
cut fi = 0fX. We then substitute Equation 9 into Equa¬ 
tion 8, and introduce L 2 -norm constraint to B to avoid the 
overhtting. Thus, the Equation 8 is reformulated as the fol¬ 
lowing optimization problem with respect to B 

B = argmm(Tr(B^XLifX^B)-KA||X^B-y||^-Kp||B||^). 

( 10 ) 

where p is a positive regularization parameter. Since Lh 
is a positive semi-dehnite matrix, this problem is a typical 
Regularized Least Square (RES) problem that can be effi¬ 
ciently solved. We obtain the partial derivative of Equa¬ 
tion 10 with respect to B, and equate it to zero, which leads 
to a closed-form solution for B as follows 

XLhX^B + X{XX^ - XY) + r]B = Q 
B = {XLhX'^ FXXX"^ + r]I)~'^{XXY) (11) 

=> B = {XLhX'^ FXXX"^+ r]I)~'^{XX{2H-1)). 

At test time, given a unlabeled sample Zi, its attribute pre¬ 
dictions can be achieved by projecting the sample into the 
subspace spanned by B, 

P* = signjzfB), (12) 

where sign(-) returns the sign of each element of a vector 
and Pi = [pii, • • • - ■ ■ ,Pim] is a row vector encoded 

the predicted attributes. Its j-th element pij = zf f3j is 
the conhdence of the existence of the j-th attribute with re¬ 
spect to the sample Zi. We call the subspace spanned by B 
Attribute Prediction Space (APS), since each basis of this 
space actually is a predictor of a specihc attribute. 


3.4. Incorporating Class and Side Information 

As a regularized graph learning approach, it is flexible to 
introduce other meaningful constraints to further enhance 
attribute learning. In this section, we take the class label as 
an example to show how to leverage any additional informa¬ 
tion to enhance our model. The exploitation of class labels 
can enhance the classihcation abilities of HAP algorithms, 
since homogenous samples always share more similarities 
in attributes. 

We adopt two approaches to incorporate the class in¬ 
formation. The hrst approach uses a hypergraph Gc = 
(y, Ec) to depict the class relation of samples, similar to the 
way we used a hypergraph to depict the attribute relations 
in the previous subsection. It is not hard to derive the hyper¬ 
graph Laplacian Lc from this hypergraph via following the 
same way as Equation 21. The second approach, following 
[3, 11], constructs a pairwise graph Gl = (V, El) in a su¬ 
pervised way for encoding the class information. We can 
encode class information using a graph, since unlike the at¬ 
tributes, the classes are disjoint. Two samples are connected 
with an edge if they belong to the same class (homogenous 
samples). Similar to the hypergraph model, the heat kernel 
weighting is adopted as the edge weighting scheme. Einally, 
the well known Laplacian Eigenmapping model can easily 
derive the graph Laplacian Ll ■ 

Introducing such class-label graph G^or hypergraph Gq 
to the loss function in Equation 21 leads to the new loss 
function 

H(B,G,G.) = Ti{F'^L„F + 'yF'^L,F) 

= Tr(B^X(Lff 7L*)X'^B) 

= TiiB^XLwX'^B), (13) 

where G* denotes either Gl or Gc and L* is the corre¬ 
sponding Laplacian matrix. Lw = + 7 L* is the com¬ 

bination of the original and new Laplacian matrices. Ac¬ 
cording to Equation 10, the new objective function can be 
reformulated as follows 

B = argmin(Tr(B^XLvrX^B)-hA||X^B-y||^-^ 7 ||B||^) 

(14) 

and the solution of B achieved by just replacing Lh with 
Lw in Equation 11 

B = (XLwX'^ + XXX'^ + pI)~^{XX{2H - 1)). (15) 

We call these models Class Specihc Hypergraph-based At¬ 
tribute Predictor (CSHAP). To distinguish the Hypergraph- 
based CSHAP and Graph-based (Laplacian eigenmapping- 
based) CSHAP, we respectively denote them by CSHAP 
and CSHAPg for short. We hypothesize that CSHAPq is 
expected to capture the intra-manifold structure between 
the samples better than CSHAP//, since CSHAP// just 
group the homogenous samples using hyperedges, while 
CSHAP/; preserves the pair-wise structure. 


If more additional information are available, the graph 
Laplacians that encode these information, can be also 
added, Lw = Lh + 71 -Li + • • • + 7 oia- Such positive reg¬ 
ularization parameters 7^,1 € {1, • • • ,«} can be deduced 
by multiple kernel learning, since each Laplacian matrix is 
associated with an affinity matrix (similarity matrix) which 
can be considered as a kernel matrix. 

3.5. Kernelization of HAP 

The mapping from the feature space to the shifted at¬ 
tribute space (aligned embedding space) may be not lin¬ 
ear. This motivated us to present the kernelization for our 
method. According to the generalized representer theorem 
[24], a minimizer of a regularized empirical risk function 
over a RKHS can be represented as a linear combination of 
kernels, evaluated on the training set. Inspired by the repre¬ 
senter theorem on the attribute classification risk function, 
we embed kernel representation of the samples {i.e. KxB). 
This transformation could be interpreted that each dimen¬ 
sion is the embedding is linear combination of the kernel- 
evaluations on the training set, which matches the represen¬ 
ter theorem. 

F^KxB (16) 

where Kx is an n x n kernel matrix associated with a kernel 
function fc(-, •), and n is the number of points in the training 
set. Therefore, the objective functions of the Kernelized 
HAP (KHAP) and Kernerlized CSHAP (KCSHAP) can be 
denoted as follows 

B = a.xgram{Ti{B'^ KxLaKxB) + \\\KxB - y||" -f v\\B\f} 

where La is equal to Lh in the KHAP case and Lw in the 
KCSHAP case. The solution of Equation 17 is 

B = {KxLaKxV + HKjc -b r,ir\>^KxY) (17) 


Then, the attribute predictions can be obtained as follows 


p(«*) = sign(k(2;*)^B) = sign I ^ k(2 

\i=i 


,XifB 


(18) 


where k(z*) = [fc(z*, Xi), • • • ,/c(z*, a;„)]. z* is the test 
sample. 

3.6. Zero-Shot and N-Shot Learning 

Typically there are two ways to annotate the samples us¬ 
ing attributes, either to assign the attributes for each sample, 
or to assign the attributes for each class. Our proposed ap¬ 
proach supports both of these two scenarios. 

At zero-shot or N-shot time, before we classify sam¬ 
ples based on the predicted attributes, we use the sigmoid 
function to normalize the obtained attribute confidences 
Si = zf B into the range [0,1]. 


1 

l + exp(-^)’ 


(19) 


where p is a positive scaling parameter and = 
[th, ■ ■ ■ , Tim] is the normalized attribute confidence vector 
which can be deemed as the probabilities of the existences 
of attributes. 

In the case where only the classes are labeled with at¬ 
tributes, we follow the approach of Direct Attribute Predic¬ 
tion (DAP) [18, 17] where the Bayes’ rule is adopted to cal¬ 
culate the posterior of a test class of a given sample based 
on its attribute probabilities r. The sample is labeled with 
the class with the maximum posterior. 

With regard to the case where each sample is annotated 
with attributes, we define the mean of the attribute proto¬ 
types in the same class as the attribute template for this 
class. We denote the template of class j as tj. The ele¬ 
ments of this template indicate the prior probabilities of the 
attributes with respect to this class. We classify the sam¬ 
ples by directly measuring the Euclidean distance between 
the attribute existence probabilities of the sample and the 
attribute template of a class 


£{zi) = argmin||ri — tjj|^ (20) 

3 

where £(•) returns the class label of a sample. 

4. Experiments 

4.1. Experimental Setups 

Datasets: We use three datasets to validate the proposed 
approach: Animal With Attributes (AWA) [18], Caltech- 
UCSD Birds (CUB) [30] and Unstructured Social Activity 
Attribute (USAA) [10]. AWA contains 30,475 images of 50 
animal classes. Each class is annotated with 85 attributes. 
Eollowing [18, 17], we divide the dataset into 40 classes 
(24,295 images) to be used for training and 10 classes (6180 
images) for testing. CUB (2011 version) [30] contains 
roughly 11,800 images of 200 bird classes. Each class is 
annotated with 312 binary attributes. We split the dataset 
following [ 1 ] to facilitate direct comparison (150 classes for 
training and the rest 50 classes for testing). USAA is a video 
dataset [10] with 69 instance-level attributes for 8 classes of 
complex social group activity videos from YouTube. Each 
class has around 100 training and testing videos respec¬ 
tively. We follow [ 1 0] for splitting the dataset by randomly 
dividing the 8 classes into two disjoint sets of four classes 
each for training and testing (the mean accuracies will be 
reported). 

Features: We adopt the 4096-dimensional deep learning 
features named DeCAE [6] as the baseline feature for the 
AWA dataset since these features are already been avail¬ 
able online for comparison'. We extract 4096-dimensional 
deep learning features called Caffe [16] for representing the 
images in CUB database. The USAA databases already 
provided the 14,000-dimensional baseline features^ which 
are constructed from six histogram features, namely RGB 


Vi = 



Table 1. Average Attribute Prediction Accuracies (in AUC). 


Approaches 

Prediction Accuracies (%) 

AWA USAA CUB 

HAP 

CSHAPh 

CSHAPg 

DAP [17] 
lAP [17] 
ALE [1] 

74.0 61.7±1.3 68.5 

74.0 62.2±0.8 68.7 

74.3 61.8±1.8 68.5 

72.8/63.0* — 61.8 

72.1/73.8* — — 

65.7 — 60.3 


color histograms, SIFT, rgSIFT, PHOG, SURF and local 
self-similarity histograms [10]. 

Metrics: We report the classification accuracy (in %) av¬ 
eraged over the classes as the N-shot learning and ZSL 
accuracy in the AWA and CUB databases. In the USAA 
database, we follow [10, 9] and report the absolute classi¬ 
fication accuracy of data. For attribute prediction accura¬ 
cies, we report the average Area Under Curve (AUC) for 
the ROC. 

4.2. Attribute Prediction 

We report the attribute prediction performance of dif¬ 
ferent approaches in Table 1. Three well known attribute 
learning approaches, namely Direct Attribute Prediction 
(DAP) [18, 17], Indirect Attribute Prediction (lAP) [18, 17] 
and Attribute Label Embedding (ALE) [ 1 ] are reported for 
comparison. The sign indicates the performance of run¬ 
ning the code provided by the authors on the DeCafe fea¬ 
tures we are using, which are also provided by the authors^. 
Erom the results, we can find that HAP, CSHAP// and 
CSHAPg outperform all the compared approaches. Eor ex¬ 
ample, the accuracy gains of HAP, CSHAP// and CSHAP^j 
over DAP are 11%, 11% and 11.3% respectively under the 
same features and experimental settings. The CSHAPjj 
performed slightly better than the other two HAP algo¬ 
rithms. This is not surprising, since the contribution of the 
class label in CSHAP// and CSHAP^ is expected to be lim¬ 
ited for attribute prediction. 

4.3. Zero-Shot Learning 

The results of seven recent ZSL approaches are comple¬ 
mented for comparison in Table 2. These are Attribute Hi¬ 
erarchical Label Embedding (AHLE) [1], Hierarchies La¬ 
bel Embedding (HLE) [1], Multi-modal Latent Attribute 
Topic Model (M2LATM) [10], Propagated Semantic Trans¬ 
fer (PST) [22], Zero-Shot Random Eorests (ZSRE) [13], 
Category-Level Attribute approach (CLA) [33] and Decor- 
related Attributes (DA) [14]. Eor [ 13, 33] we only compared 
with their results on the attributes provided by the dataset, 
and not their results using discovered attributes. As before, 
the sign indicates the performance of running the code 

* http://www.ist.ac.at/ chl/AwA/AwA-features-decaf.tar.bz2. 

^http://www.eecs.qmul.ac.uk/ yf300/USAA/download/ 

^We use the code and features available in the AWA webpage. The 
parameters of the model are well tuned using cross validation to get the 
best performance. 


Table 2. Zero-shot Learning Accuracies. 


Approach 

Classification Accuracies (%) 

AWA 

USAA 

CUB 

HAP 

45.0 

44.1 ±3.6 

17.5 

CSHAP^f 

45.6 

45.3±4.2 

17.5 

CSHAPg 

45.0 

44.6±3.7 

17.5 

DAP [17] 

41.4/42.8* 

35.2 

10.5 

IAP [17] 

42.2/35.7* 

— 

— 

ALE [1] 

37.4 

— 

18.0 

HLE [1] 

39.0 

— 

12.1 

AHLE [1] 

43.5 

— 

17.0 

M2LATM [10] 

41.3 

41.9 

— 

PST [22] 

42.7 

36.2 

— 

ZSRE [13] 

43.0 

— 

— 

CLA [33] 

42.3 

— 

— 

DA [14] 

30.6 

— 

— 


provided by the AWA authors on the same features we are 
using, which are also provided by the authors. 

From the results in Table 2, we can notice that the pro¬ 
posed approaches outperform the compared methods on the 
AWA and USAA datasets,. For example, the accuracy gains 
of CSHAPff over DAP and AHLE are 2.8% and 2.1% on 
AWA dataset. On the USAA dataset, the performance im¬ 
provements of HAP algorithms over other approaches are 
more significant. CSHAP/f obtains 10.1% and 3.0% more 
accuracies in comparison with DAP and M2LATM. Al¬ 
though HAP algorithms have not obtained the best perfor¬ 
mance in comparison with AHLE in CUB dataset, it still 
outperforms other approaches. Moreover, the performance 
gaps between HAP algorithms and AHLE are only 0.5% in 
this dataset. 

In the experiments, CSHAP/f often achieves better re¬ 
sults than HAP, since the CSHAP/f attempts to leverage the 
class labels for clustering the homogenous samples together 
in the attribution prediction step. Similar phenomenon can 
be also observed for CSHAPg while its improvement is less 
significant. We attribute it to the mechanism of CSHAPg 
where it tries to preserve the manifold structure of each class 
using the given class labels. In ZSL case, the test classes are 
unseen in the train dataset. Therefore, CSHAPg may not 
capture the manifold structures of unseen classes. Another 
interesting phenomenon is that CSHAP can enhance HAP 
in the AWA and USAA databases, while cannot improve on 
the CUB database. This is because the CUB databset has 
more attributes which are already enough for distinguish 
classes while the other two datasets have less attributes so 
that they benefitted more from the complementary informa¬ 
tion (The number of attributes of USAA, AWA and CUB 
dataset are 69, 85 and 312 respectively). 

4.4. N-Shot Learning 

We extend ZSL into N-Shot Learning (NSL) where a few 
(N) samples of the test classes are added to the training 
dataset. The experiments are conducted on the USAA and 
AWA datasets. Eigure 3 shows the trend of the three ap¬ 
proaches as N increases. In these experiments, we find that 












(a) CUB (b) USAA 

Figure 3. N-Shot Learning accuracies on two datasets. 


all three methods can get similar performances when N is 
small. Interestingly, along increasing N, CSHAPq signifi¬ 
cantly outperforms others. This confirms with our hypoth¬ 
esis about the way CSHAP^j captures the intra-class mani¬ 
fold structure while incorporating the class information, in 
contrast to just grouping the homogenous samples together 
as CSHAP// does. In such case, the addition of the sam¬ 
ples of the test classes improves the quality of the captured 
intra-class manifold structures, and therefore improves the 
performance of CSHAP^. 

Note the performances of HAP methods on USAA 
dataset are much better than the ones reported in section 4.3, 
since only half of the testing samples and training samples 
are used, similar strategy is used in CUB dataset. Com¬ 
pared to the previous experiment N samples per each test 
class have to be taken out for the NSL training. 

4.5. Full Data Categorization 

We consider the attributes as the feature representation 
to tackle the common categorization task. The default 
splits of data defined in the datasets are employed. Ta¬ 
ble 3 reports the results. Two classification methods are 
employed for classification. The first one is the one de¬ 
fined in Equation 20. The second one is the simple Eu¬ 
clidean distance-based Nearest Neighbor Classifier (NNC). 
The sign ’f’ indicates the results using NNC. Similar to the 
in NSL, CSHAPg achieves the best performances since it 
better integrates the class information. 


Table 3. Full Data Categorization Accuracy (%). 


Database 

Classification Accuracies (%) 

HAP CSHAPj^ CSHAPg AHLE 

CUB 

USAA 

23.1/21.41' 23.1/21.2/ 26.1/24.5/ 23.5 

50.1/48.1/ 50.1/48.1/ 50.9/49.2/ — 


4.6. Evaluations of Kernel HAP Algorithms 

We also conduct several experiments to test the poten¬ 
tials of the kernel HAP algorithms in Zero-Shot Learning. 
The Gaussian kernel and Cauchy kernel are applied. Eig- 
ure 4 shows the ZSL performances of these kernel HAP al¬ 
gorithms in USAA and CUB datasets. In USAA dataset, we 
can find that Gaussian kernel improves the ZSL accuracies 
of HAP, CSHAPjj and CSHAPg from 44.1%, 45.3% and 
44.6% to 46.3%, 48.2% and 46.7%. The ZSL accuracies 
of three Cauchy kernel-based HAP algorithms are 46.1%, 



(a) CUB (b) USAA 

Figure 4. The performances of Kernel HAP algorithms in USAA 
and CUB datasets. 


48.3% and 47.0%. In CUB database, these two kernels ac¬ 
tually reduced the ZSL accuracies of HAP algorithms. The 
accuracies of the kernel HAP algorithms is around 15.5% 
to 16.5%. We attribute this to the fact that the deep features 
used in CUB dataset are originally designed to be linear. 

4.7. The Parameters and Computational Cost 

The experimental results of the choices of the parameters 
are reported in the supplementary material. Since HAP is 
graph-based algorithm and involves the matrix inversion, its 
computational complexity for training is the minimum of 
0{nmd) and 0{(P) and its computational complexity of 
testing is 0{d). So, it is more time consuming for training 
but quite efficient for testing. Taking the CUB dataset as 
an example (5994 samples for training and 5974 samples 
for testing), the time for training 312 attribute predictors is 
23.33 seconds. The time for predicting the attributes of all 
test samples is 0.14 seconds. The code is written in matlab 
and the experimental hardware configuration is Quad-Core 
CPU: 2.5 GHz, RAM: 8G. 

5. Conclusion 

We presented a novel attribute prediction approach 
called Hypergraph-based Attribute Predictor (HAP) via de¬ 
riving a collection of attribute classifiers from the hyper¬ 
graph embedding, in which the attribute relations are con¬ 
sidered as hyperedges and the hypergraph cuts are the at¬ 
tribute predictions. The hypergraph formulation facilitates 
exploiting the correlations of the attributes as well as jointly 
learning the attribute predictors. Moreover, the additional 
information can be flexibly incorporated into HAP via en¬ 
coding the information in a penalty graph or hypergraph. To 
generalize the mappings between the feature space and at¬ 
tribute space which are known as the attribute predictors, we 
also kernelized the model. Extensive experiments on three 
well known attribute datasets demonstrated the effective¬ 
ness of our model for attribute prediction, Zero-Shot Learn¬ 
ing, N-Shot Learning and categorization. Lrom the results 
we can conclude that the CSHAPg variant is the best to in¬ 
tegrate class labels for N-shot learning, however the three 
proposed variants performs similarly in zero-shot learning 
task. 
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The Supplementary Materials 

The Derivation Details of Attribute Relation Loss Function 

In this section, we will introduce the detailed derivations of Equation ( 6 ). Before it, let us review some related notations. 
The attribute relations are encoded in a given hypergraph G = (V, E) where V is the vertex set and E denotes the hyperedge 
sets. In this hypegraph, each vertex is corresponding to an instance and each hyperedge is associated with an attribute 
relation. The degree of a hyperedge e G E, which is denoted as S{e), is the number of vertices in e. The {v, e)-th element 
of vertex-edge incidence matrix H G is considered as h{v,e) = 1 if v G e otherwise h(v,e) = 0 where v G V. 

eGE ^{e) = w{e)h{v, c) denotes the degree of the vertex v. w{e) is the weight of the hyperedge e. We 

denote the diagonal matrix forms of &{e), d(y) and w{e) as D^, Dy and W respectively. We obtain the attribute predictions 
by defining a collection of hypergraph cuts which is denoted as F. returns a row vector of F which is corresponding to 
the predictions of attributes for the vertex u. Lh is the normalized hypergraph Laplacian matrix which is derived from the 
hypergraph of attributes, and I is an identity matrix. Tr( ) is the trace of the matrix. 

Now we can present the detailed derivations of Equation ( 6 ) as follows: 
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( 21 ) 


The Influences of Parameters 

There are several parameters in HAP models. They are /r. A, p and 7 . /r is used for controlling the degree of hyperedge 
weighting. A is used for controlling the trade off between the attribute relation loss and the attribute prediction error, p is 
employed for avoiding the overfitting. 7 is adopted for controlling the degree of penalty of the side information loss. The 
ultimate goals of different attribute learning-based systems are different. Some systems may aim at annotation or retrieval. 
These systems pay more attention on the improvement of the attribute prediction accuracy. Some other systems may focus 
on the categorization, i.e.. Zero-shot Learning, N-shot Learning and Attribute-based categorization. These systems pay more 
attention on the exploitation of discriminating power of attributes. In our approach, it is available for us to tune the parameters 
to decide which evaluation metric we care more. Therefore, we will separately discuss the choices of parameters in these two 
cases. 

The Influences of Parameters to The Attribute Prediction 

HAP has three parameters, fj,, A and 7 , need to tuned. CSHAP algorithms have one more parameter 7 need to be tuned. 
In the parameter selection procedure, we choose one parameter to tune and fix the values of the other parameters. The initial 
values of fi. A, p and 7 are equal to 0.1. The parameter selection procedure is start from p, to 7 . Once the optimal value 
of a parameter is learned, its corresponding initial value is replaced by that optimal value for more accurately estimating 
the optimal values of the rest parameters. Eigures 5, 6 and 7, respectively reports the attribute prediction accuracies using 

























(a) AWA (b) USAA (c) CUB 

Figure 5. The influences of to the attribute prediction accuracies. 
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Figure 6. The influences of A to the attribute prediction accuracies. 




(a) AWA (b) USAA (c) CUB 


Figure 7. The influences of rj to the attribute prediction accuracies. 


different /r, A, rj on different databases. From the observations, all HAP algorithms can achieve the best performances on all 
three databases when /i = 1; The best choices of A for AWA, USSA and CUB databases are 10, 0.1 and 1 respectively and 
such numbers of rj are 10, 0.1 and 10. Figure 8 plots the relationships between 7 and the attribute prediction accuracy on 
three different databases. We can find that CSHAPq is more sensitive to 7 . It is not hard to conclude from the observations 
that the optimal values of 7 are 1, 0.01 and 10 for AWA, USAA and CUB databases respectively. 

The Influences of Parameters to Zero-Shot Learning 

In Zero-Shot Learning (ZSL), we need to employ the sigmoid function to normalize the attribute confidences, which are 
obtained by our models, into range [0,1]. So, there is one additional parameter p should be studied in this section. We follow 
the aforementioned parameter selection manner to select the parameters. The selection procedure is start from pto p where 
the initial value of p is 0.5. Figure 9 shows the ZSL accuracies under different p. On AWA and CUB database, all three 
approach can get the best performances when p = 0.1 while the optimal value of p on USAA database is 1. Compared 
with other parameters, HAP algorithms are relatively insensitive to p when its value is bigger than 1. Figures 10 and 11 
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(a) AWA (b) USAA (c) CUB 

Figure 8 . The influences of 7 to the attribute prediction accuracies. 


demonstrate the impacts of ZSL accuracies from A and rj respectively. The curves of these figures share similar behavior that 
their peaks are very explicit. From the observations, we can know that the optimal values of A are 1, 0.01 and 0.1 on AWA, 
USAA and CUB databases respectively while such numbers of ry are 1, 0.1 and 0.1. As same as the phenomenon observed 
in Figure 8, Figure 12 also shows that CSHAP^ is very sensitive to 7 but CSHAP// is robust to 7 . Here, we suggest to set 
the 7 of AWA, USAA and CUB databases to 1, 10“^ and 10“^ respectively. Figure 13 reports the ZSL performances under 
different p. However, it is really hard to conclude uniform setting for each database. So we choose different p for different 
approaches. More specifically, we suggest to choose the p in the range [0.007, 0.02] for CSHAP// while choose the p in the 
range [0.06,0.1] for HAP and CSHAPg on AWA database. On USAA database, CSHAPg can get good ZSL performances 
when p is in the range [0.005,0.01] while the good p for CSHAP// and HAP should be above 0.3. The impacts of p to the 
performances of all three algorithms are similar on CUB databases. The observations indicate that the p which is larger than 
1 can get the good performances for all three HAP algorithms. 



(a) AWA (b) USAA (c) CUB 

Figure 9. The influences of p to the ZSL accuracies. 



(b) USAA 

Figure 10. The influences of A to the ZSL accuracies. 
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Figure 11. The influences of rj to the ZSL accuracies. 



(a) AWA (b) USAA (c) CUB 

Figure 12. The influences of 7 to the ZSL accuracies. 
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(b) USAA 

Figure 13. The influences of p to the ZSL accuracies. 
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